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V  ABSTRACT 

Multi-sample  cluster  analysis,  the  problem  of  grouping  samples,  is 
studied  from  an  information-theoretic  viewpoint  via  Akaike's  Information 
Criterion  (AIC).  This  criterion  combines  the  maximum  value  of  the  likelihood 
with  the  number  of  parameters  used  in  achieving  that  value.  The  multi-sample 
cluster  problem  is  defined,  and  AIC  Is  developed  for  this  problem.  The  form  of 
AIC  is  derived  in  both  the  multivariate  analysis  of  variance  (MANOVA)  model  and 
in  the  multivariate  model  with  varying  mean  vectors  and  variance-covariance 
matrices.  Numerical  examples  are  presented  for  AIC  and  another  criterion 
called  w-square.  The  results  demonstrate  the  utility  of  AIC  in  identifying  the 
best  clustering  alternatives.^ 
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1.  Introduction 

In  this  paper,  we  shall  develop  Akalke's  Information  Criterion  (AIC)  for 
multi-sample  cluster  analysis  with  common  and  also  with  varying  variance- 
covariance  matrices,  since  often  In  practice  the  assumption  of  equal  variance- 
covariance  matrices  Is  a  rather  dubious  requirement. 

The  problem  of  multi-sample  cluster  analysis  arises  when  we  are  given  a 
collection  of  samples  (groups,  treatments),  to  be  clustered  Into  homogeneous 
groups. 

Many  practical  situations  require  the  presentation  of  multivariate  data 
from  several  structured  samples  for  comparative  purposes  and  the  grouping  of 
the  heterogeneous  samples  Into  homogeneous  sets  of  samples.  Thus,  It  Is 
reasonable  to  provide  a  practically  useful  statistical  procedure  that  would  use 
some  sort  of  statistical  model  to  aid  In  comparisons  of  various  collections  of 
samples.  Identify  homogeneous  groups  of  samples,  telling  us  which  samples 
should  be  clustered  together  and  which  should  not. 

Examples  of  multi-sample  clustering  situations  are  abundant  In  practice. 

We  shall  give  two  of  these  examples  later  and  Illustrate  numerically. 

The  concept  of  multi-sample  cluster  analysis  presented  In  this  paper  is 
relatively  new.  It  has  not  been  definitively  studied  before  either  using  the 
conventional  simultaneous  test  procedures  (STP's)  which  are  based  on  inference 
for  the  multivariate  analysis  of  variance  (MANOVA)  model,  or  from  an  Informa¬ 
tion-theoretic  viewpoint,  which  we  shall  adopt  in  this  paper  via  Akaike's 

♦Presented  by  the  first  author  as  an  Invited  *aper,  Special  Session  on 
Cluster  Analysis,  789th  Meeting,  American  Mathematical  Society,  University  of 
Massachusetts,  Amherst,  MA,  October  lb-18,  1981. 


Information  Criterion  (AIC). 

Multivariate  analysis  of  variance  (MANOVA)  is  a  widely  used  model  for 
comparing  two  or  more  multivariate  samples  with  a  common  covariance  matrix.  In 
this  model,  the  likelihood  ratio  principle  leads  to  Wilks'  [17]  lambda,  or  In 
short  Wilks'  A  Criterion  as  the  test  statistic.  It  plays  the  same  role  In 
multivariate  analysis  that  F-ratlo  statistic  plays  in  the  univariate  case. 

Often,  however,  the  formal  analyses  involved  in  MANOVA  are  not  revealing  or 
Informative.  Moreover,  the  test  statistics  used  under  this  model  are  derived 
under  the  assumption  of  equal  covariance  matrices.  If  we  have  a  reason  to  doubt 
equality  of  covariances,  then  we  may  first  want  to  test  the  equality  of  covari¬ 
ances.  In  the  multivariate  case  the  equality  of  covariance  matrices  is 
certainly  more  hazardous.  If  the  covariance  matrices  are  unequal,  a  bias  occurs 
In  the  test  for  equality  of  mean  vectors.  Therefore,  for  this  reason  we  may 
want  to  first  test  the  equality  of  covariance  matrices  instead  of  immediately 
leaping  to  the  MANOVA  hypothesis.  This  is  an  important  option  to  use  In 
clustering  groups  or  samples  when  we  are  not  willing  to  assume  equal  covariance 
matrices  between  the  samples  or  groups  in  the  multi-sample  data. 

Once  the  MANOVA  hypothesis  of  equality  of  mean  vectors  Is  rejected  at  some 
prescribed  significance  level  o,  then  it  is  necessary  to  study  in  detail  the 
discrepancies  between  the  null  hypothesis  and  the  data. 

In  the  statistical  literature,  in  the  MANOVA  case,  there  are  a  variety  of 
conventional  multiple  comparison  procedures  for  stuaying  the  discrepancies 
between  the  null  hypothesis  and  the  data.  These  test  procedures  are:  Step-down 
Methods,  Union  Intersection  Tests,  and  Simultaneous  Confidence  Intervals.  For 
more  details  on  these  test  procedures  refer  to  liabriel  [7j,  Krishnaiah  ([10], 
[11]),  Srivastava  [16],  and  others. 


As  noted  In  Consul  [4],  the  exact  distributions  of  these  conventional 
test  procedures  are  either  unknown  or  are  known  for  some  particular  cases 
only.  Moreover,  the  problem  of  finding  the  percentage  points  of  these  statis¬ 
tics  has  become  rather  difficult.  For  these  reasons,  and  for  our  purposes, 
these  test  procedures  have  little  practical  use.  Furthermore,  tney  create 
additional  problems  In  terms  of  how  to  control  the  overall  error  rate  a,  since 
we  can  no  longer  use  the  same  a  to  discover  where  the  discrepancies  Detween 
the  null  hypothesis  and  the  data  might  occur. 

In  the  case  of  testing  the  equality  of  covariance  matrices,  we  find  our¬ 
selves  In  the  same  situation  as  In  the  MANOVA  model.  For  this  problem  also, 
there  are  In  the  statistical  literature  several  test  procedures.  For  example, 
one  of  the  most  commonly  used  tests  Is  Box's  M  test  despite  the  fact  that  it 
is  very  restrictive.  For  instance.  Box's  approximation  seems  to  be  only  yood 
if  each  sample  size,  ny  exceeds  20,  and  If  the  number  of  samples,  K,  and  the 
number  of  variables,  p,  exceed  5.  It  Is  also  very  expensive  to  compute  it  on 
a  high  speed  computer,  even  on  an  IBM  370. 

Once  the  hypothesis  of  equality  of  covariance  matrices  Is  rejected  at 
some  prescribed  significance  level  a,  then  again  it  is  necessary  to  study  in 
detail  the  discrepancies  between  the  null  hypothesis  and  the  data. 

Further  reviewing  the  statistical  literature,  we  see  that  there  are  no 
conventional  simultaneous  test  procedures  (STP's)  in  this  case  in  studying  tne 
discrepancies  between  the  null  hypothesis  and  the  data.  One  can  pernaps 
construct  a  sequential  likelihood  ratio  type  test,  but  as  is  mentioned  in 
Muirhead  ([14],  p.  296),  the  likelihood  ratio  test  in  testing  the  equality  of 
covariance  matrices  has  the  defect  that,  when  the  sample  sizes  nltn2,...,n 
are  not  all  equal,  it  is  biased.  Therefore,  in  the  multi-sample  cluster 
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problem  with  varying  parameters,  carrying  out  a  sequence  of  likelihood  ratio 
tests  leaves  much  to  be  desired  in  identifying  homogeneous  groups  of  samples. 

More  recently,  however.  In  the  statistical  literature,  we  see  a  likelihood 
based  approach,  called  w-sguare  criterion  given  in  Mardia  et  al_.  L12J  to  aid 
In  comparing  various  collections  of  samples,  identifying  homogeneous  groups  of 
samples,  and  telling  which  should  be  clustered  together.  For  normal  samples 
with  equal  covariance  matrices,  the  w-square  criterion  is  defined  by 

(^•1)  *  I  I  nj (5 j“ig) (£j".g) 

9-1  ^jrfg 

where 

Cg  ■  the  set  of  Xj  assigned  to  the  gth  group,  g»l,  2 . K, 

a 

Xg  *  the  weighted  mean  vector  of  the  means  In  the 
gth  group,  or  the  cluster  set  Cg  of  groups, 

A 

z  •  W/(n-K),  the  pooled  estimate  of  £, 

K 

W  *  7  Aa  is  the  witn In-samples  SSP  matrix, 
g-1 

K 

n  ■  [  (in,  and 

g*i 

K  «  the  number  of  groups  or  samples  to  be  clustered. 

If  the  matrix  of  Mahal anobls  distances  Uij  given  by 

D*  »  (Xi-Xj)  •r.-i(x1-xJ) 
ij 


2 

Is  available,  then  for  computational  convenience,  w  can  be  written  as 

a 

2  <  -1  2 

(1.2)  w  «  1/2  l  N  l  ninjU  . 
a  g«l  9  Cy 


wnere 


Ng  *  _  l  "j- 


X.eC 
-J  9 


Thus,  when  we  are  given  multi-sample  data  and  wish  to  cluster  tne  samples, 
2 

we  compute  w  in  (1.1)  or  (1.2)  for  some  or  all  of  the  alternative  groupings  of 

*  2 

samples,  and  choose  the  minimum  of  w  to  be  the  “best"  alternative  clustering 

a 

of  samples.  This  Is  appropriate,  since  maximizing  the  likelihood  Implies 
2 

minimizing  w  . 

„  3  2 

Even  though  the  w  criterion  Is  a  step  forward  In  Identifying  nomoyeneous 

Q 

groups  of  samples  and  evaluating  multi-sample  clusters.  It  has  some  disadvan¬ 
tages.  For  Instance,  It  does  not  make  any  allowance  for  m,  the  number  of 
parameters  estimated  within  the  model  and  the  subsequent  alternative  submodels. 
It  Is  always  zero  when  the  groups  or  samples  are  clustered  as  singletons,  as  we 

shall  see  later  In  Section  4.  As  it  Is  given  In  (1.1),  we  can  only  work  with 
2 

w  criterion  when  we  assume  equal  covariance  matrices, 
a 

For  the  above  stated  reasons,  and  the  problems  encountered  In  the  conven¬ 
tional  test  procedures  which  we  discussed  above,  in  this  paper  we  shall  propose 
Akalke's  Information  Criterion  (AIC)  as  a  new  and  unifying  procedure  for 
evaluating  multi-sample  clusters,  and  use  it  to  identify  the  best  clustering 
alternatives. 


In  1971,  Akalke  first  Introduced  an  information  criterion,  referred  to  as 


a  Model  Identification  Criterion  or  Akalke's  Information  Criterion  (AIC),  for 
the  Identification  and  comparison  of  statistical  models  In  a  class  of  competing 
models  with  different  numbers  of  parameters.  It  Is  defined  by 

(1.3)  AIC  »  (-2)loge(max1m1zed  likelihood)  * 

+  2  (number  of  free  parameters  within  the  model). 

It  was  obtained  by  Akalke  ([2],  [3])  based  on  the  recognition  that  the  classi¬ 
cal  method  of  maximum  likelihood  could  be  viewed  as  a  method  of  Identification 
of  a  statistical  model  realized  by  maximizing  an  estimate  of  the  generalized 
entropy,  or  the  expected  log  likelihood,  of  the  model  being  fitted.  It  esti¬ 
mates  minus  twice  the  expected  log  likelihood  of  the  model  whose  parameters  are 
determined  by  the  method  of  maximum  likelihood.  When  several  competing  models 
are  being  compared  or  fitted,  AIC  Is  a  simple  procedure  wnlch  measures  the 
badness  of  fit  or  the  discrepancy  of  the  estimated  model  from  the  true  model 
when  a  set  of  data  is  given.  The  first  term  In  (1.3)  stands  for  the  penalty  of 
badness  of  fit  when  the  maximum  likelihood  estimators  of  the  parameters  of  the 
model  are  used.  The  second  term  In  the  definition  of  AIC,  on  the  other  hand, 
stands  for  the  penalty  of  Increased  unreliability  or  compensation  for  the  bias 
In  the  first  term  as  a  consequence  of  Increasing  number  of  parameters.  If  more 
parameters  are  used  to  describe  the  data,  it  is  natural  to  get  a  larger 
likelihood,  possibly  without  improving  the  true  goodness  of  fit  by  penalizing 
the  use  of  additional  parameters. 

Thus,  when  there  are  several  competing  models,  the  parameters  within  the 
models  are  estimated  by  the  method  of  maximum  likelihood  and  tne  AIC -values  are 
computed  and  compared  to  find  a  model  with  the  minimum  value  of  AIC.  This 
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procedure  is  called  the  minimum  AIC  procedure.  The  model  with  the  minimum  AIC 
is  called  the  minimum  AIC  estimate  (MAICE)  and  is  designated  as  the  best  model. 
Thus,  in  applying  AIC  the  emphasis  Is  on  comparing  the  “goodness  of  fit"  of 
various  models  with  an  allowance  made  for  parsimony. 

In  Section  2,  we  shall  define  the  general  multi-sample  cluster  problem. 

In  Section  3,  we  shall  derive  the  AIC  procedure  both  for  the  multivariate 
analysis  of  variance  (MANOVA)  model,  and  for  the  multivariate  model  with  vary¬ 
ing  covariance  matrices.  We  shall,  in  Section  4,  give  different  numerical 
examples  of  multi-sample  cluster  analysis  on  different  real  data  sets  to  demon¬ 
strate  our  results  from  applying  minimum  AIC  procedures  in  different  computer 
analyses.  Finally,  in  Section  5,  we  shall  present  our  conclusions  and 
discussion. 


2.  The  Multi -Sample  Cluster  Problem 


Suppose  each  individual,  object,  or  case,  has  been  measured  on  p  response 
or  outcome  measures  (dependent  variables)  simultaneously  in  K  independent 
groups  or  samples  (factor  levels).  Let 


(2.1) 


X_  (n  x  p) 


be  a  single  data  matrix  of  K  groups  or  samples,  where  j(g  (n^xp)  represents  the 

observations  from  the  g-th  group  or  sample,  g»l,2,...,K,  and  n  »  J  n.,.  The 

g«l 

goal  of  cluster  analysis  is  to  put  the  K  groups  or  samples  into  k  homogeneous 


groups,  samples,  or  classes  where  k  is  unknown,  but  k4(. 

Often  individuals  or  objects  nave  been  sampled  from  K>1  populations.  The 

/ 

data  matrix  may  be  represented  in  partitioned  form  as  above.  Let  ng  represent 

the  number  of  individuals  in  the  y-th  (random)  sample,  g*I,2 . K.  The  ng  are 

not  restricted  to  being  equal  or  proportional  to  other  nQ's.  The  total  number 

K  * 

of  observations  is  n  a  £  ng.  Let  Xg-j  be  the  pxl  vector  of  observations  in 

g*i 

group  g*l,2,...,K,  and  for  Individual  i=l,2,...,ng. 

3.  Derivation  of  AIC  for  Two  Multivariate  Models 

3.1  AIC  for  the  Multivariate  Analysis  of  Variance  (MANOVA)  Model: 

AIC  (common  £) 

We  now  turn  our  attention  to  consider  situations  with  several  multivariate 
normal  samples. 

For  example,  we  may  have  multi-sample  data  with  sample  sizes  n  ,  n  ,...,n 

1  2  K 

which  are  assumed  to  come  from  K  populations,  the  first  with  mean  vector  u1  and 

covariance  matrix  z,  the  second  with  mean  vector  v2  and  covariance  matrix  £,..., 

the  Kth  with  mean  vector  u  and  covariance  matrix  Z.  Therefore,  throughout  this 

K 

section  we  shall  suppose  that  we  may  have  independent  data  matrices 
X  ,  X  ,...,X  ,  where  the  rows  of  X,g(ngxp)  are  independent  and  identically  distri 
buted  (i.I.d.)  according  to  a  multivariate  normal  distribution,  Np(yy,jE), 
g*l,2,...,K.  We  may  want  to  compare  the  K  sample  mean  vectors  given  that  all  K 
distributions  have  a  common  covariance  matrix  s.  This  is  the  well  known  multi¬ 
variate  analysis  of  variance  (MANOVA)  model.  In  terms  of  the  parameters  the 
MANOVA  model  is  »,»•••»!£  »l)  with  m*kp+p(p+l)/2  parameters,  where  k  is  the 

number  of  groups,  and  p  is  the  number  of  variables. 

We  shall  derive  the  form  of  AIC  for  this  model.  Recall  the  definition  of 
AIC  from  Section  I, 
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A 

AIC  3  -2  loge  L(0)  +  2m 

*  -2  loge  (maximized  likelihood)  +  2m  , 

where  m  denotes  the  number  of  free  parameters  within  the  model. 

Consider  K  normal  populations  with  different  mean  vectors  ug,  g3l,2,..., 
k,...,K.  Let  Xg^,  g*l,2,...,K;  i3l,2,...,ng,  be  a  random  sample  of  observations 
from  the  g-th  population  Np{yg,sJ.  If  the  groups  or  samples  can  differ  only  in 
their  mean  vectors,  we  can  write  the  multivariate  one-way  analysis  variance 
(MANOVA)  model  as 

(3.1.1)  xg1-  a  Wy  +  figi,  g*l,2,...,K;  1*1,2,. ..,ng, 

where  Xgi  is  the  (pxl)  response  or  outcome  vector  in  the  g-th 

group  for  the  1-th  individual  or  object, 

Ug  are  vector  parameters,  and 

cgi  are  independent  Np(0,£)  random  vector  errors. 

Thus,  the  basic  null  hypothesis  we  usually  are  interested  in  testing  is 
given  by 

(3.1.2)  H0  :  fil  •  u2  .....  y 

The  alternative  hypothesis  is  given  by 

H,  :  Not  all  u  are  equal. 

1  'K 

Wilks'  lambda  is  a  general  statistic  for  handling  this  problem.  Although 
there  are  several  other  conventional  statistics  for  this  purpose,  they  all  can 
be  viewed  as  special  cases  of  Wilks'  A  which  we  shall  not  discuss  here. 
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For  notatlonal  purposes,  we  shall  denote  by  J[  the  “total"  sum  of  squares 
and  products  (SSP)  matrix,  by  W  the  "within-group"  or  “wi thin-sample"  SSP 
matrix,  and  by  jl  the  "between -group"  SSP  matrix.  Hence,  it  can  be  shown  that 

(3.1.3)  T  »  W  +  8  , 


where 

(3.1.4) 


K  ng 

I  *  l  1  (Xgi  -  X)(Xg1  -  X)', 
9*1  i  =1  '  y 


(3.1.5) 


K  ng 

-  3  £  E  ^gi  - 

g*l  i  *1 


and 

(3.1.6) 


K 

!’  [  ng  (Xg 
g-1 


> 


with 


x-g 


IT  xgi 
ng  -jsi  -9 


9*1 ,2 . K  , 


X 


,5 


K 

n  *  l  ng  . 

9*1 


Therefore,  we  can  present  multivariate  one-way  analysis  of  variance 


(MANOVA)  table  as  follows. 


1 

Source 

• 

SSP  matrix 

Wilks'  criterion 

Between  samples 

K-l 

B 

|W| 

?*■  • 

i 

III 

1 

Within  samples 

n-K 

W 

-A(p  ;  n  -  K  ;  K  -  1) 

'  *  ’ 

Total 

n-1 

T 

.  «• 

Now,  we  derive  the  form  of  Akalke  s  Information  Criterion  (AIC)  for  the 
MANOVA  model  given  in  (3.1.1),  subject  to  the  constraint  given  in  (3.1.2). 
The  likelihood  function  of  all  the  sample  observations  is  given  by 


(3.1.7)  ^^Ug»^g*£.)  *  H  ^q(ug»Eg»jLg) , 


or  by 

(3.1.8)  L  «  (2»fnp/2  n  lig|“ng/2x 

g»l 


K  _1  K  .1  _ 

exp  {-l/2tr  £  JJg  Ag  “  l/2tr  £  flgZg  (Xg  "  Ug)(xg  "  Ug)'}  , 

q*l  3  g*l  *  '  ' 


■  •  y  —  — 

where  n  »  J  ng  and  Ag  *  £  (Xq1  -  Xg)(Xq1-  -  Xq)' 
g-1  i-1  ’a  “a 


The  log  likelihood  function  is 


T 
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(3.1.9)  ®  1  OgL  (  Ug>.Sg  j  XJ 

K  K.  .1 

•  -  (np/2)  log(2x)  -  1/2  £  ng1og| _Eg|  -  i/2tr  J  ^  A« 

g*l  g*l 


K  .1  _ 

"  1/2tr  Lng£g  (*g  "  fcHxg  -  Kg)  *  • 

9-1 


Since  the  common  covariance  iratrix  is  the  log  likelihood  function 
becomes 


(3.1.10)  l({ug},r,X)  s  logL({Bg),r;X) 


-1  K 

*  -  (np/2)log(2*)-(n/2)log  l E.I  -  i/2tr£  }  Ag 

g*l 


.1  K 

l/2trl  l  ng(Xg  -  jlg)(Xg  -  |lg  )  '  , 

9-1 


and  the  raxi mum-likelihood  estimates  (MLE's)  of  ^g,  and  z  are 


likelihood  becomes 


(3.1.13)  H(Wg}»S;X)  5  log  L((ug},r;^) 

-1 

a  -  (np/2 )  1  og( 2-ir )  -  (n/2)log|n  M|  -  (np/2), 

where  Is  the  "with In-group"  SSP  matrix. 

Since 

A 

(3.1.14)  AIC  -  -2  logeL(e)  +  2m  , 

where  m  *  kp  +  ^  is  the  number  of  parameters,  then  AIC  becomes 

(3.1.15)  AIC  (common  z)  *  np1oge(2if)  +n1oge|n”1Wj  +  np  +  2[kp+ 

Since  the  constants  do  not  affect  the  result  of  comparison  of  models,  we 
could  ignore  them  and  reduce  the  form  of  AIC  to  a  much  simpler  form 

(3.1.16)  AIC*  (common  z)  *  nloge|W|  +  2[kp  +• 

K 

where  n  *  £  nq  *  the  total  sample  size, 

g*i 

|WJ  *  the  determinant  of  "with in-group"  SSP  matrix, 
k  «  number  of  groups  or  samples  compared, 
p  *  number  of  variables. 

However,  for  purposes  of  comparison  we  retain  the  constants  and  use 
AIC  (common  z). 
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3.2.  AIC  for  the  Multivariate  Model  with  Varying  Parameters: 

AIC  (varying  ^  and  l) 

As  we  mentioned  In  Section  1,  the  assumption  of  equality  of  covariance 
matrices  In  MANOVA  can  cause  serious  problems.  For  this  reason  we  nay  want 
first  test  the  equality  of  covariance  matrices  against  the  alternative  that 
not  all  covariance  matrices  are  equal,  given  no  restriction  on  the  population 
mean  vectors.  Therefore,  throughout  this  section  we  shall  suppose  that  we  nay 
have  Independent  data  matrices  where  the  rows  of  J(g  (ngxp)  are 

Independent  and  Identically  distributed  (1.1. d. )  Np(|fg»rg)»  g«l,2 . K.  In 

terms  of  the  parameters  with  varying  mean  vectors  and  covariance  matrices, 
the  nultlvariate  model  we  shall  consider  is 

5  *  V^i*^*****^ 

with  m  »  kp  +  kp(p+l)/2  parameters,  where  k  Is  the  number  of  groups,  and  p  Is 
the  number  of  variables. 

Thus,  the  basic  null  hypothesis  we  usually  are  interested  In  testing  is 
given  by 

(3.2.1)  Hq:  Sj  »  z2  »  .  .  .  » 

The  alternative  hypothesis  Is  given  by 

Hj :  Not  all  K  covariance  netrlces  are  equal. 

In  nultlvariate  analysis  this  Is  known  as  the  test  of  homogeneity  of 
covariance  net  rices. 

To  derive  Akalke's  Information  Criterion  (AIC)  in  this  case  the  log 
likelihood  function  is  given  by 


(3.2.2)  1  ( {ug»_£g} »X_)  =  log  L({yg,^_g};^) 


K  K 

■  -  (np/2)  log( 2ir )  -  1/2  £  n  log|E  |  -  1/2  £  n  tri“lA 

g»i  g  g  g*i  g  99 


K 

1/2  I  rtg(Xg  -  jJg)'(Xg  -  Jig)  . 
g»l  " 


The  MLE's  of  jig  and  Eg  are 

A 

(3.2.3)  Wg  *  Xg  ,  g*l ,2 . K, 

and  . 

A 

(3.2.4)  Eg  Ag/flg. 

Substituting  these  back  Into  (3.2.2)  and  simplifying,  the  maximized  log 
likelihood  becomes 

A  A  A  A 

(3.2.5)  Htyg.IgJjX)  s  log  L({yg,Eg} ;X) 


K  .1 

■  -  (np/2)log(2ir)  -  1/2  T  n  log|n  A  |  -  (np/2). 

g»l  g  g  g 


Since 

A 

(3.2.6)  AIC  *  -2  logeL(9)  +  2m  , 

where  m  »  kp  +  kp(p+l)/2  Is  the  number  of  parameters,  then  AIC  becomes 


K  -1 


+  2pcp  +  kp(p+l)/2]. 


Since  the  constants  do  not  affect  the  result  of  comparison  of  models,  we 
could  Ignore  them  and  reduce  the  form  of  AIC  to  a  much  simpler  form 

K 

(3.2.8)  AIC*(varying  u  and  Z)  »  l  nglog  |Ag|  +  2[kp  +  kp(p+l)/2], 

q»l  e  3 


where  ng  ■  sample  size  of  group  or  sample  g*l,2,...K, 

|Ag|  *  the  determinant  of  sum  of  squares  and  cross-products  (SSCP) 
matrix  for  group  or  sample  g*l,2,...,K, 
k  a  number  of  groups  or  samples  compared,  and 
p  »  number  of  variables. 

However,  for  purposes  of  comparison  we  retain  the  constants  and  use  AIC  given 
by  (3.2.7). 


4.  Numerical  Examples  of  Multi-Sample  Cluster  Analysis  on  Real  Data  Sets 

In  this  section  we  shall  give  two  different  numerical  examples  of  multi¬ 
sample  cluster  analysis,  cluster  the  samples,  and  choose  the  best  clusterings 
by  using  Akalke's  Information  Criterion  (AIC)  as  derived  in  Section  3.1  and 
3.2.  In  example  4.2  we  shall  also  present  the  numerical  results  of  using  the 
w-square  criterion  as  an  alternative  approach.  We  shall  briefly  discuss  the 
relative  merits  of  AIC  over  w-square  criterion.  One  should  note  that  these 
criteria  are  qualitatively  and  quantitatively  different. 

Our  computations  were  carried  out  for  all  the  examples  we  shall  present 


here  on  an  IBM  4341,  configured  as  a  370,  by  using  a  newly  developed 
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statistical  library  software  by  the  first  author  for  multi-sample  cluster 
analysis  using  AICt  called  MSCA.AIC. 

We  shall  illustrate  our  results  first  on  the  Fisher  [5]  iris  data. 
Example  4.1.  Clustering  of  Irises  by  Groups:  The  iris  data  set  is  composed 


of  150  iris  species  belonging  to  three  groups  or  species,  namely  Iris  setosa 
(S),  Iris  versicolor  (Ve),  and  Iris  vlrqlnlca  (Vi)  measured  on  sepal  and  petal 
length  and  width.  Each  group  is  represented  by  50  plants. 

This  data  set  has  been  quite  extensively  studied  in  classification  and 
cluster  analysis  since  It  was  published  by  Fisher  [5],  and  still  today,  is 
being  used  as  a  "testing  ground"  for  classification  and  clustering  methods 
proposed  by  many  Investigators  such  as  Friedman  and  Rubin  [6],  Kendall  [8], 

Solomon  [15],  Mezzlch  and  Solomon  [13],  and  many  others.  Including  the  present 
authors. 

For  each  of  the  150  plants  we  already  know  the  group  structure  of  the 
iris  species,  namely  K*3  groups  or  samples.  Even  though  the  two  species.  Iris 
setosa  and  Iris  versicolor  were  found  growing  in  the  same  colony,  and  Iris 
virginica  was  found  growing  In  a  different  colony,  Fisher  reports  in  his 
linear  discriminant  analysis  the  separation  of  I.  setosa  completely  from  I. 
versicolor  and  I.  virginica.  Since  then  other  investigators  have  shown 
similar  results  in  their  studies  such  as  the  ones  we  mentioned  above. 

With  this  In  mind,  we  cluster  K*3  samples  (species)  into  K*l,2,  and  3 
groups  on  the  basis  of  all  the  four  variables.  We  obtain  in  total  five 
possible  clustering  alternatives.  (In  general,  the  total  number  of 
possibilities  Is  a  Stirling  Number  of  the  Second  Kind;  see,  e.y.,  Abramowitz 
and  Stegun  [1]).  Oenotlng  I.  setosa  by  S,  I.  versicolor  by  Ve,  and 
I.  virginica  by  VI,  we  have  (S)  (Ve)  (Vi),  (S,  Ve)  (VI),  (S,  Vi)  (Ve),  (Ve,  Vi )  (S ) , 
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and  (S,  Ve,  Vi)  as  the  possible  clustering  alternatives.  Using  the  MANQVA  model 
and  the  multivariate  model  with  varying  parameters  discussed  in  Sections  3.1  and 
3.2  as  our  underlying  models  for  clustering  these  three  iris  species,  we  obtained  the 
following  results. 

TABLE  4.1.  THE  AIC'S  FOR  IRISES  BY  GROUPS  ON  ALL  VARIABLES  UNDER  MANOVA  MODEL 


Alternative 

Clustering 

nloge(2ir) 

nlogeln-1^! 

np 

D 

2m 

AIC  (common  jl) 

1 

(S)  (Ve)  (Vi) 

1,102.724 

-1,504.2 

600 

3 

44 

242.524a 

2 

(S,  Ve)  (Vi) 

1,102.724 

-1,085.9 

600 

2 

36 

652.824 

3 

(S,  Vi)  (Ve) 

1,102.724 

-  988.39 

2 

36 

750.334 

4 

(Ve,  Vi)  (S) 

1,102.724 

-1,299.6 

600 

2 

36 

439.124b 

5 

(S,  Ve,  VI) 

1,102.724 

-  941.73 

60U 

1 

23 

788.994 

n  ■  150  plants,  p  *  4  variables 
m  »  kp  +  p(p+l)/2  parameters 


AIC  (common  z)  »  nplog  (2ir)  +  nlogeJn*lW|+  np  +  2m 

e 

aFirst  Minimum  AIC 
bSecond  Minimum  AIC 

TABLE  4.2.  THE  AIC'S  FOR  IRISES  BY  GROUPS  UN  ALL  VARIABLES  UNDER  THE  MODEL  WITH 
VARYING  PARAMETERS 


“1 

Alternative 

Clustering 

nploge(2») 

< 

l  n  logelng^AJ 
g»l  9 

np 

B 

2m 

AIC  (varying 
u  and  z) 

1 

(S)  (Ve)  (VI) 

1,102.724 

-1,653.895 

3 

84 

132.829a 

2 

(S,  Ve)  (Vi) 

1,102.724 

-1,251.675 

600 

2 

56 

5U7.049 

3 

(S,  Vi)  (Ve) 

1,102.724 

-1,144.480 

600 

2 

56 

614.244 

4 

(Ve,  Vi)  (S) 

1,102.724 

-1,463.770 

600 

2 

d6 

294.9640 

5 

(S,  Ve,  VI) 

1,102.724 

-  941.580 

600 

1 

28 

789.144 

n  *  150  plants,  p  *  4  variables 
m  *  kp  +  kp(p+l)/2  parameters 

K 

AIC  (varying  y  and  Z)  *  nploge  (2ir)  +  £  ngloggln"1  AJ  +  np  +  2m 

g»l  9 

aF1rst  Minimum  AIC 
^Second  Minimum  AIC 


Looking  at  Tables  4.1  and  4.2,  we  see  that,  using  all  four  variables 
simultaneously  under  both  models,  the  MAICE  clustering  Is  (S)  (Ve)  (VI).  This 
Indicates  that  Indeed  there  are  three  types  of  species.  Not  surprisingly,  the 
the  second  minimum  A1C  occurs  at  the  alternative  submodel  4  (Ve,  VI)  (S), 
under  both  models,  telling  us  that  If  we  were  to  cluster  any  one  of  the  two 
Iris  groups,  we  should  cluster  I.  versicolor  and  I.  vlrglnlca  together  as  one 
homogeneous  group,  and  we  should  cluster  I.  setosa  completely  separately.  We 
note  that  the  AIC  values  under  submodel  2  and  3  are  quite  large  Indicating  the 
Inferiority  of  these  submodels.  We  can  see  the  effect  of  clustering  I.  setosa 
with  I.  versicolor  In  submodel  2,  and  also  with  I.  vlrginica  In  submodel  3, 
by  comparing  the  difference  of  AIC's  In  these  submodels  with  that  of  submodel  4 
In  which  I.  versicolor  and  I.  vlrginica  were  clustered  together  and  I.  setosa 
was  clustered  as  a  separate  cluster  on  Its  own.  According  to  AIC,  we  never 
cluster  three  Iris  species  as  one  homegeneous  group  (submodel  5).  Again  by 
comparing  the  differences  of  AIC's  of  submodel  5  with  that  of  submodels  4,  3, 
and  2,  respectively,  we  can  measure  the  amount  of  heterogeneity  contributed  by 
I.  setosa,  I.  versicolor,  and  I.  vlrginica,  respectively,  in  each  clustering 
alternative  under  the  MANOVA  model  and  the  multivariate  model  with  varying  mean 
vectors  and  covariance  matrices.  The  larger  this  difference,  the  greater  the 
heterogeneity  or  separation  of  that  group  or  sample  from  that  of  homegeneous 
groups  or  samples  in  each  clustering  alternative. 

In  comparing  the  AIC's  In  Tables  4.1  and  4.2,  we  further  notice  tnat 
AIC  (varying  u  and  Z)  values  are  much  less  than  the  AIC  (common  z)  values  for  each 

m 

of  the  clustering  alternatives  except  for  the  last  clustering  alternative  (i.e., 
alternative  5)  In  clustering  the  iris  groups  or  species.  Since  according  to  tne 
definition  of  AIC,  the  model  with  the  minimum  AIC  is  chosen  to  be  the  best  model , 


T 


then  the  above  results  suggest  that  wnen  we  are  clustering  iris  data,  and  in 
general,  we  should  use  different  covariance  matrices  rather  than  using  equal 
covariance  matrices. 

We  now  present  our  results  on  the  iris  data  by  using  the  w-square 
criterion  given  by  (1.1)  In  Section  1,  when  we  assume  equal  covariance  matrices 
between  the  iris  groups  or  species.  We  should  note  here  that  in  w-square 
criterion  given  by  (1.1)  and  in  Mardia  et  al_.  ([12],  p.  367),  the  estimated 
pool ed-wi thin  groups  covariance  matrix  of  £  Is  computed  only  once  across  all 
the  groups  or  samples  to  be  clustered  regardless  of  the  number  of  clustering 
alternatives.  In  our  version  of  w-square  criterion  we  follow  the  same  proce¬ 
dure,  but  we  recompute  the  estimate  of  £  in  each  clustering  alternative  when  we 
vary  the  number  of  clusters  of  groups  or  samples,  K,  when  we  are  given,  K,  the 
number  of  groups  or  samples  to  be  clustered.  We  do  this  both  under  the  assump¬ 
tion  of  equal  and  separate  covariance  matrices  between  the  iris  groups.  There¬ 
fore,  our  numerical  values  on  w-square  criterion  are  quite  different  then  the 
original  w-square  criterion  given  In  Mardla  et  aU  [12],  despite  the  fact  that 
we  get  the  same  results. 

We  give  the  computational  results  as  follows. 

TABLE  4.3.  THE  VALUES  OF  w?  FOR  IRISES  BY  GROUPS  ON  ALL  VARIABLES 

9 


Alternative 

Cluster!  ng 

2, 

w  (common  E) 
a 

2/  b  1 

w  (common  E) 
a  “ 

2  c 

w  (varying  E) 
a 

i 

(SI  (V»l  (VII 

* 

* 

♦ 

\»e;  • ) 

2 

(S,  Ve)  (VI) 

2246.6046 

137.9722 

94.4149 

3 

(S,  VI)  (Ve) 

4484.6178 

142.3345 

96.0706 

4 

(Ve,  VI)  (S) 

430.0267** 

109.5511** 

76.8212** 

5 

(S,  Ve,  VI) 

4774.1661 

175.2U91 

176.2U91 
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n  *  150  plants,  p  *  4  variables 
2 

Original  w  given  in  (1.1) 
a 

b.c,  2 

Our  version  of  w 

a 

2 

*w  cannot  be  computed  (always  equal  to  zero) 
a 

2 

**Hln1mum  of  w 

a 

Hence,  we  interpret  the  results  in  Table  4.3  in  the  same  manner  as  we  did 

2 

for  AIC's.  We  see  that  at  the  alternative  submodel  1,  w  cannot  be  computed  and 

a 

Is  always  equal  to  zero  when  the  iris  groups  are  clustered  as  singletons.  This 

Is  always  the  case  In  general.  Certainly  this  is  a  definite  disadvantage  of  w* 

a 

as  compared  to  A1C  which  has  a  value  even  If  the  Iris  groups  are  clustered  as 

singletons,  so  that  AIC  can  aid  us  In  determining  and  understanding  the  amount 

of  heterogeneity  or  separation  of  the  groups  on  a  unique  scale.  The  minimum  of 
2 

w  occurs  at  the  alternative  submodel  4,  telling  us  again  that.  If  we  were  to 

3 

cluster  any  one  of  the  two  Iris  groups,  we  should  cluster  I.  versicolor  and 
I.  vlrglnlca  together  as  one  homogeneous  group,  and  we  should  cluster  I.  setosa 
completely  separate  as  one  heterogeneous  group. 

In  short,  w-square  criterion  gives  the  same  results  as  AIC  does,  but  as  we 
mentioned  In  Section  1,  It  does  not  make  apy  allowance  for  m,  the  number  of 
parameters  estimated  within  the  clustering  alternatives.  AIC  makes  such  an 
allowance  to  achieve  a  parsimony  when  we  compare  "the  goodness  of  fit"  of  vari- 
lous  models  as  we  do  In  comparing  different  clustering  alternatives.  W-square 
criterion  Is  short  of  having  this  Important  feature.  Also  as  we  saw,  wnen  we 
have  singleton  clusters.  It  cannot  be  computed. 
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Therefore,  in  our  next  example,  we  shall  only  give  our  results  on  AIC, 
since  our  purpose  is  to  Introduce  AIC  in  this  paper  as  a  new  approach  to  be  used 
in  evaluating  multi-sample  clusters. 


Example  4.2.  Clustering  Graduate  Students  by  Their  Classification  Groups: 

A  data  set  for  applicants  to  admission  to  a  Graduate  School  of  Business  given  in 
Johnson  and  Wichern  ([9],  p.  528)  is  composed  of  data  for  85  applicants  who  were 
classified  by  the  admissions  officer  as  Admit  (A),  Not  Admit  (NA),  and  Borderline  (B), 
based  on  undergraduate  grade  point  average  (GPA)  and  graduate  management  aptitude  test 
(GMAT)  scores.  The  group  sizes  are  ni  *  31,  n2  *  28,  and  03*26  applicants. 

With  this  in  mind,  we  cluster  K*3  groups  of  applicants  Into  k*l,2  and  3  homoge¬ 
neous  groups  on  the  basis  of  the  two  variables.  Using  the  MANOVA  model  and  the  multi¬ 
variate  model  with  varying  parameters,  our  results  are  as  follows. 


TABLE  4.4.  THE  AIC'S  FOR  APPLICANTS  BY  THEIR  CLASSIFICATION  GROUPS  UNOER 
MANOVA  MODEL 


— 

Alternative 

Clustering 

nloge(2w) 

nloge|n"lW| 

np 

n 

AIC  (common  r) 

1 

(A)  (NA)  (8) 

312.4391 

406.1716 

170 

3 

18 

906.6lU7a 

2 

(A.NA)  (B) 

312.4391 

566.7477 

170 

2 

14 

1063.186 8 

3 

(A,B)  (NA) 

312.4391 

491.7043 

170 

2 

14 

988.1434C 

4 

(B,NA)  (A) 

312.4391 

474.0420 

170 

2 

14 

970.4811b 

5 

(A,  8,  NA) 

312.4391 

581.9931 

170 

1 

10 

1074.4322 

n  *  85  applicants,  p  *  2  variables 
m  *  kp  +  p(p+l)/2  parameters 

AIC  (common  £)  *  nploge(2w)  +  nlogeln'1^*  np  +  2m 
a 

First  Minimum  AIC 
b 

Second  Minimum  AIC 
CThird  Minimum  AIC 


TA8LE  4.5.  THE  AIC'S  FOR  APPLICANTS  BY  THEIR  CLASSIFICATION  GROUPS 
UNDER  THE  MODEL  WITH  VARYING  PARAMETERS 


Alternative 

Clustering 

nloge(2ir) 

K 

l  nglogeK-1^! 
9*1 

np 

H 

2m 

AIC  (varying  y 
and  £) 

1 

(A)  (NA)  (8) 

312.4391 

388.7472 

170 

901.1863a 

2 

(A.NA)  (B) 

312.4391 

509.4198 

170 

2 

20 

1011.8589 

3 

(A, 8)  (NA) 

312.4391 

480.2378 

2 

982.6769C 

4 

(B,NA)  (A) 

312.4391 

465.7116 

170 

2 

20 

968.1507b 

5 

(A,  B,  NA) 

312.4391 

581.9931 

170 

1 

10 

1074.4322 

n  *  85  applicants,  p  *  2  variables 
m  »  kp  +  kp(p+l)/2  parameters 

K  .1 

AIC  *  (varying  y  and  Z)  *  nploge(2ir)  +  £  n  loge|n  Aq|  +  np  +  2m 
a  "  g*l  9  9 

First  Minimum  AIC 
b 

Second  Minimum  AIC 


Third  Minimum  AIC 


Hence,  looking  at  Tables  4.4  and  4.5,  we  see  that,  under  both  models,  the  first 
minimum  AIC  occurs  at  the  alternative  submodel  1,  that  is,  when  (A)  (NA)  (B)  are  all 
clustered  separately.  This  Indicates  that  indeed  there  are  three  groups  of  applicants 
Therefore,  the  MAICE  is  submodel  1.  The  second  minimum  AIC  occurs  at  the  alternative 
submodel  4  again  under  both  models,  telling  us  that  if  we  were  to  cluster  any  one  of 
the  two  groups,  then  we  should  cluster  Borderline  (B)  and  Not  Admit  (NA)  yroups 
together  as  one  homogeneous  group,  and  we  should  cluster  Admit  (A)  group  completely 
separate  as  one  heterogeneous  group.  On  the  other  hand,  if  we  want  to  make  a  third 
choice,  then  the  third  minimum  of  AIC  occurs  at  the  alternative  submodel  3,  Indicating 
to  us  the  closeness  of  the  Admit  (A)  group  to  the  Borderline  (B)  group  as  one  homoge¬ 
neous  cluster,  and  leaving  Not  Admit  (NA)  group  on  its  own  as  a  singleton  cluster. 
Therefore,  this  way,  we  can  check  the  significance  of  each  of  the  the  clustering 


alternatives  In  the  decision  making  process.  In  this  example,  we  also  never 
cluster  the  three  groups  as  one  homogeneous  group  (submodel  5). 

In  comparing  the  AIC's  in  Tables  4.4  and  4.5.  for  this  example  we  also 
notice  that.  AIC  (varying  y  and  z)  values  are  less  than  the  AIC  (common  z) 
values  for  each  of  the  clustering  alternatives  except  for  the  last  clustering 
alternative  (l.e.,  alternative  5)  in  clustering  the  applicant  groups.  These 
results  suggest  that  we  should  use  different  covariance  matrices.  However, 
the  values  of  AIC  (varying  u  and  z)  and  AIC  (common  z)  are  significantly 
closer  to  one  another  that  if  we  were  to  assume  equal  covariance  matrices 
between  the  applicant  groups  a  priori,  it  would  not  have  been  a  dubious 
assumption  for  this  particular  data  set. 

Thus,  it  should  be  noted  that  via  AIC  we  can  now  easily  check  the  validity 
of  our  assumptions  in  terms  of  using  equal  covariance  matrices  as  opposed  to 
separate  covariance  matrices  in  a  particular  data  set  wnich  is  important  in  the 
multi-sample  clustering  situation,  and  in  general. 

5.  Conclusions  and  Discussion 

From  our  numerical  results  in  Section  4,  we  see  that  AIC  and  consequently 
minimum  AIC  procedures  can  indeed  successfully  identify  the  best  clustering 
alternatives  when  we  cluster  samples  into  homogeneous  sets  of  samples  both  in 
the  MANOVA  model  and  the  multivariate  model  with  varying  covariance  matrices. 

Me  can  measure  the  amount  of  homogeneity  and  heterogeneity  in  clustering 
samples.  We  can  determine  a  priori  whether  we  should  use  equal  or  varying 
covariance  matrices  in  the  analysis  of  a  data  set. 


The  fact  that  AIC  does  not  require  the  table  look-up,  which  is  the  case  in 
conventional  procedures,  adds  to  the  importance  of  the  results  obtained.  This 
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is  one  of  the  Important  virtues  that  AIC  breaks  away  from  conventional  proce¬ 
dures  which  try  to  test  whether  a  parameter  is  “signif1cantM  or  not  using  a 
significance  level  a  which  is  essentially  arbitrary.  The  other  Important 
virtue  of  AIC  is  that  the  penalty  represented  by  the  term  2  x  (number  of  free 
parameters)  clearly  demonstrates  the  necessity  of  choosing  a  class  of  models, 
at  least  one  of  which  will  be  able  to  provide  a  good  approximation  to  the 
distribution  of  data  without  adjusting  too  many  parameters. 

Thus,  In  concluding,  we  see  that  the  use  of  AIC  shows  how  to  combine  tne 
information  in  the  likelihood  with  an  appropriate  function  of  the  number  of 
parameters  to  obtain  estimates  of  the  information  provided  by  competing 
alternative  models.  Therefore,  the  definition  of  MAICE  gives  a  clear 
formulation  of  the  principle  of  parsimony  In  statistical  model  building  or 
comparison  as  we  demonstrated  by  numerical  examples.  And  MAICE  provides  a 
versatile  procedure  for  statistical  model  Identification  which  is  free  from 
the  ambiguities  Inherent  in  the  application  of  conventional  statistical 
procedures. 
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